As part of our phase 1 analysis we performed functional annotation of our phase 1 variants with respect to both coding and non-coding annotation from GENCODE and the ENCODE project respectively.
This functional annotation can be found in our phase 1 analysis results directory. We present both the annotation we compared the variants to and VCF files which contain the functional consequences for each variant.
The 1000 Genomes Project was divided into stages. Initially, a set of pilot projects were undertaken, followed by the main project, which was broken into three phases.
The initial part of the Project was called the pilot project. This was split into three pilot studies: the low coverage pilot (pilot 1), the high coverage pilot (pilot 2) and the exon targeted pilot (pilot 3). This data was completed in 2009 and published in Nature in 2010. All of the data associated with the pilot projects is available in the pilot_data directory on the FTP site.
Phase 1 represented low coverage and exome data analysis for the first 1092 samples. The phase 1 low coverage alignments and exome alignments are available in the phase 1 directory on the FTP site. Analysis of phase 1 was published in 2012. The analysis results associated with the paper can be found in the phase 1 analysis_results directory. The low coverage sequence data from phase 1 is listed in the 20101123 sequence index and the exome data in the 20110521 sequence index.
During phase 2 the set of samples expanded to around 1700 in number. The sequence data is represented in the 20111114 sequence index. This data was used for method development, to both improve on existing methods from phase 1 and also develop new methods to handle features like multi-allelic variant sites and true integration of complex variation and structural variants.
Phase 3 represents 2504 samples, including additional African samples and samples from South Asia. The methods developed in phase 2 were applied to this data set and a final catalogue of variation was released on the FTP site. These results were published in two publications in 2015, one covering the main project and the other focusing on structural variation.
LDAF is an allele frequency value in the info column of our phase 1 VCF files.
Our standard AF values are allele frequencies rounded to 2 decimal places calculated using allele count (AC) and allele number (AN) values. LDAF is the allele frequency as inferred from the haplotype estimation.
You will note that LDAF does sometimes differ from the AF calculated on the basis of allele count and allele number. This generally means there are many uncertain genotypes for this site. This is particularly true close to the ends of the chromosomes.
The phase 1 variants list released in 2012 and the phase 3 variants list released in 2014 overlap but phase 3 is not a complete superset of phase 1. The variant positions between phase 3 and phase 1 releases were compared using their positions. This shows that 2.3M phase 1 sites are not present in phase 3. Of the 2.3M sites, 1.92M are SNPs, the rest are either indels or structural variations (SVs).
The difference between the two lists can be explained by a number of different reasons.
1. Some phase 1 samples were not used in phase 3 for various reasons. If a sample was not part of phase 3, variants private to this sample are not be part of the phase 3 set.
2. Our input sequence data is different. In phase 1 we had a mixture of both read lengths 36bp to >100bp and a mixture of sequencing platforms, Illumina, ABI SOLiD and 454. In phase 3 we only used data from the Illumina sequencing platform and we only used read lengths of 70bp+. We believe that these calls are higher quality, and that variants excluded this way were probably not real.
3. The first two reasons listed explain 548k missing SNPs, leaving 1.37M SNPs still to be explained.
The phase 1 and phase 3 variant calling pipelines are different. Phase 3 had an expanded set of variant callers, used haplotype aware variant callers and variant callers that used de novo assembly. It considered low coverage and exome sequence together rather than independently. Our genotype calling was also different using ShapeIt2 and MVNcall, allowing integration of multi allelic variants and complex events that weren’t possible in phase 1.
891k of the 1.37M sites missing from phase 1 were not identified by any phase 3 variant caller. These 891k SNPs have relatively high Ts/Tv ratio (1.84), which means these were likely missed in phase 3 because they are very rare, not because they are wrong; the increase in sample number in phase 3 made it harder to detect very rare events especially if the extra 1400 samples in phase 3 did not carry the alternative allele.
481k of these SNPs were initially called in phase 3. 340k of them failed our initial SVM filter so were not included in our final merged variant set. 57k overlapped with larger variant events so were not accurately called. 84k sites did not make it into our final set of genotypes due to losses in our pipeline. Some of these sites will be false positives but we have no strong evidence as to which of these sites are wrong and which were lost for other reasons.
4. The reference genomes used for our alignments are different. Phase 1 alignments were aligned to the standard GRCh37 primary reference including unplaced contigs. In phase 3 we added EBV and a decoy set to the reference to reduce mismapping. This will have reduced our false positive variant calling as it will have reduced mismapping leading to false SNP calls. We cannot quantify this effect.
We have made no attempt to eludcidate why our SV and indel numbers changed. Since the release of phase 1 data, the algorithms to detect and validate indels and SVs have improved dramatically. By and large, we assume the indels and SVs in phase 1 that are missing from phase 3 are false positive in phase 1.
You can get more details about our comparison from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/phase1_sites_missing_in_phase3/